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The changing heavens have played a central role in the scientific effort 
of astronomers for centuries. Galileo's synoptic observations of the moons of 
Jupiter and the phases of Venus starting in 1610, provided strong refutation 
of Ptolemaic cosmology. These observations came soon after the discovery of 
Kepler's supernova had challenged the notion of an unchanging firmament. 
In more modern times, the discovery of a relationship between period and 
luminosity in some pulsational variable stars |40| led to the inference of 
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the size of the Milky Way, the distance scale to the nearest galaxies, and 
the expansion of the Universe (see [30J for review). Distant explosions of 
supernovae were used to uncover the existence of dark energy and provide 
a precise numerical account of dark matter (e.g., [3]). Repeat observations 
of pulsars [67] and nearby main-sequence stars revealed the presence of the 
first extrasolar planets (43J [42^ [35j E7J. Indeed, time-domain observations 
of transient events and variable stars, as a technique, influences a broad 
diversity of pursuits in the entire astronomy endeavor |65j . 

While, at a fundamental level, the nature of the scientific pursuit remains 
unchanged, the advent of astronomy as a data-driven discipline presents fun- 
damental challenges to the way in which the scientific process must now be 
conducted. Digital images (and data cubes) are not only getting larger, there 
are more of them. On logistical grounds, this taxes storage and transport 
systems. But it also implies that the intimate connection that astronomers 
have always enjoyed with their data — from collection to processing to anal- 
ysis to inference — necessarily must evolve. Figure [TTT1 highlights some of the 
ways that the pathway to scientific inference is now influenced (if not driven 
by) modern automation processes, computing, data-mining and machine 
learning. 

The emerging reliance on computation and machine learning (ML) is a 
general one — a central theme of this book — but the time-domain aspect of 
the data and the objects of interest presents some unique challenges. First, 
any collection, storage, transport, and computational framework for process- 
ing the streaming data must be able to keep up with the dataflow. This is 
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Figure 1.1: Data mining, computation, and ML roles in the scientific path- 
way. 

not necessarily true, for instance, with static sky science, where metrics of 
interest can be computed off-line and on a timescale much longer than the 
time required to obtain the data. Second, many types of transient (one-off) 
events evolve quickly in time and require more observations to fully under- 




Transport & 
Storage Processing 

• high-performance computing 

• optimal image compression * 

• distributed databases 



Discovery 

Classification 

ML on images . fast feature FolloWUp 

generation •autonomous SueiUlUC 

• ML on time & 
context features 

telescopes 



foiiowupwith i n f erence 

robotic 



1.1. DISCOVERY 



3 



stand the nature of the events. This demands that time-changing events 
are quickly discovered, classified, and broadcast to other followup facilities. 
All of this must happen robustly with, in some cases, very limited data. 
Last, the process of discovery and classification must be calibrated to the 
available resources for computation and followup. That is, the precision of 
classification must be weighed against the computational cost of producing 
that level of precision. Likewise, the cost of being wrong about the clas- 
sification of some sorts of sources must be balanced against the scientific 
gains about being right about the classification of other types of sources. 
Quantifying these tradeoffs, especially in the presence of a limited amount 
of followup resources (such as the availability of larger-telescope observa- 
tions) is not straightforward and inheres domain-specific imperatives that 
will, in general, differ from astronomer to astronomer. 

This chapter presents an overview of the current directions in machine 
learning and data-mining techniques in the context of time-domain astron- 
omy. Ultimately the goal — if not just the necessity given the data rates and 
the diversity of questions to be answered — is to abstract the traditional role 
of astronomer in the entire scientific process. In some sense, this takes us 
full-circle from the pre-modern view of the scientific pursuit presented in 
Vermeer's "The Astronomer" (Figure II. 2ft : in broad daylight, he contem- 
plates the nighttime heavens from depictions presented to him on globe, 
based on observations that others have made. He is an abstract thinker, far 
removed from data collection and processing; his most visceral connection to 
the skies is just the feel of the orb under his fingers. Substitute the globe for 
a plot on a screen generated from an SQL query to a massive public database 
in the cloud, and we have a picture of the modern astronomer benefitting 
from the ML and data-mining tools operating on an almost unfathomable 
amount of raw data. 



1.1 Discovery 

We take the notion of discovery, in the context of the time domain, as the 
recognition that data collected (e.g., a series of images of the sky) con- 
tains a source which is changing in time in some way. Classification ( §1.21) 
is the quantification of the similarity of that source to other known types 
of variability and, by extension, the inference of why that source is chang- 
ing. The most obvious change to discover is that of brightness or flux. On 
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Figure 1.2: "The Astronomer," by Johannes Vermeer, c. 1668. In many ways, 
epoch of the armchair astronomer is returning to primacy. 

imaging data, changes in color and position might also be observecQ- Spec- 
troscopically, changes in emission/absorption properties and apparent veloc- 
ities might also be sought. Discovery of time- variable behavior is technique- 
specific and, as such, we will review the relevant regimes. Yip et al. [70] 
discuss variability discovery on spectroscopy line features in the context of 
active galactic nuclei. Gregory [33] presents ML-based discovery and charac- 
terization algorithms for astrometric- and Doppler-based data in the context 
of exoplanets. We focus here on the discovery of brightness/flux variability. 

1 Discovery of change in position, especially for fast- moving sources (such as asteroids), 
inheres its own set of data-mining challenges which we will discuss. See, for example, 

[37i m. 
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1.1.1 Identifying Candidates 

Pixelated Imaging: Many new and planned wide-field surveys are drawing 
attention to the need for data-mining and ML. These surveys will generate 
repeated images of the sky in some optical or infrared bandpass. These 2- 
dimensional digitized images form the basic input to discover;^]. The data 
from such surveys are usually obtained in a "background-limited" regime, 
meaning that the signal-to-noise on an exposure is dominated by the flux 
of sources (as the signal) and the background sky brightness (as the domi- 
nant noise component). Except in the most crowded images of the plane of 
the Milky Way, most pixels in the processed images contain only sky flux. 
Less than a few percent of pixels usually contain significant flux from stars, 
galaxies or other astrophysical nebulosities. 

There are two broad methods for discovering variability in such images. 
In one, all sources above some statistical threshold of the background noise 
are found and the position and flux associated with those sources are ex- 
tracted to a catalog. There are off-the-shelf codebases to do this (e.g., [61. [31] ) 
but such detection and extraction on images is by no means straightforward 
nor particularly rigorous, especially near the sky-noise floor of images. Dis- 
covery of variability is found by asking statistical questions (see §1. 1.2|> about 
the constancy (or otherwise) on the light curve produced on a given source, 
created by cross-correlating sources by their catalog position across different 
epochs [15]. The other method, called "image differencing," [6l" l [T| l68| [12] 
takes a new image and subtracts away a "reference image" of the same por- 
tion of the sky; this reference image is generally a sharp, high signal-to-noise 
composite of many historical images taken with the same instrumental setup 
and is meant to represent an account of the "static" (unchanging) sky. 

Both methods have their relative advantages and drawbacks (see [24] 
for a discussion). Since image differencing involves astrometric alignment 
and image convolution, catalog-based searches are generally considered to 
be faster. Moreover, catalog searches tend to produce fewer spuriously de- 
tected sources because the processed individual images tend to have less 
"defects" than differenced images. Catalog searches perform poorly, how- 
ever, in crowded stellar fields (where aperture photometry is difficult) and 
in regions around galaxies (where new point sources embedded in galaxy 

2 In each one minute exposure, for example, the Palomar Transient Factory [3S] produces 
11 images each from a 2k x 4k CCD array of size 1 sq. arcsecond (0.65 sq. degree per 
image). Since each pixel is 2 bytes, the amounts to 184 MB of raw data generated per 
minute. Raw data are pre-processed using calibration data to correct for variable gain and 
illumination across the arrays; spatially-dependent defects in the arrays are flagged and 
such pixels are excluded from further scrutiny. 
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light can be easily outshined). Given the intellectual interests in finding 
variables in crowded fields (e.g., microlensing; [481 [9]) and transient events 
(such as supernovae and novae) near galaxies, image-difference based dis- 
covery is considered necessary for modern surveys. 

Computational costs aside, one of the primary difficulties with image 
differencing is the potential for a high ratio of spurious candidate events 
to truly astrophysical events. A trained human scanner can often discern 
good and bad subtractions and, for many highly successful projects, human 
scanners were routinely used for determining promising discovery candi- 
dates. The KAIT supernova search [29] makes use of undergraduate scan- 
ners to shift through ~ 1000 images from the previous night. Over 1000 SNe 
were discovered in 10 years of operations with this methodology |39j . Basic 
quality/threshold cuts on the metrics about each candidate can be used to 
present to human scanners a smaller subset of images for inspection; in this 
way, the Sloan Digital Sky Survey II Supernova Search [33] netted > 300 
spectroscopically confirmed supernovae discoveries from ~150,000 manually 
scanned candidates. The Nearby Supernova Factory [2], after years of using 
threshold cuts, began to use boosted decision trees ( §1.2.2[) on metrics from 
image differences to optimize supernova discovery [4]. Unlike with specific 
domain- focused discovery surveys (like supernovae searches), many surveys 
are concerned with discovery and classification of all sorts of variable stars 
and transients. So unlike in the supernova discovery classifier of Bailey et 
al. [1] (which was highly tuned to finding transient events near galaxies), 
discovery techniques must aim to be agnostic to the physical origin of the 
source of variability. That is, there is an imperative to separate the notion 
of "discovery" and "physical classification." 

In the Palomar Transient Factory, we find at least one hundred high- 
significance bogus candidates for every one real candidate in image differ- 
ences [8]. With over one million candidates produced nightly, the number 
of images that would have to be vetted by humans is unfeasible. Instead, 
we produced a training set of human- vetted candidates, each with dozens 
of measured features (such as FWHM, ellipticity; see [8]). These candidates 
are scored on a scale from to 1 based on their inferred likelihood of be- 
ing bogus or astrophysically "real." We developed a random forest classifier 
on the features to predict the 1-0 real-bogus value and saved the result of 
the ML-classifer on each candidate. These results are used to make discov- 
ery decisions in PTF. After one year of the survey, we also created train- 
ing sets of real and bogus candidates by using candidates associated with 
known/confirmed transients and variables [45] . Figure [L3l shows the "re- 
ceiver operating characteristic" (ROC) curve for a random forest classifier 
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Figure 1.3: ROC curve for image-differenced candidates based on a train- 
ing sample of 50,000 candidates from the Palomar Transient Factory. High 
efficiency and high purity are on the bottom left of the plot. From [45 j . 

making use of the year-one training sample. 

If all real sources occurred at just one epoch of observation, then ROC 
curves such as those depicted in Figure 11.31 would directly reflect discovery 
capabilities: type I error (false- negatives) would be the efficiency for discov- 
ery and type II error (false-positive rate) would be the purity for discovery. 
However, most transient events occur over several epochs and bogus candi- 
dates often do not recur at precisely the same location. Therefore, turning 
candidate- level ROC curves to global discovery efficiency /purity quantities 
is not straightforward. In PTF we require 2 high-quality ML-score candi- 
dates within a 12 day window to qualify a certain position on the sky as a 
discovery of a true astrophysical sourcqj. In the first 8 months of produc- 
ing automatic discoveries with PTF, our codebase independently discovered 
over 10,000 transients and variable stars. 

Radio Interferometry: Traditionally, radio images are generated from 
raw u-v interferometric data using a human intensive process to iteratively 
flag and remove spurious baseline data. Phase drift due to the ionosphere, in- 
strumental instability, and terrestrial radio-frequency interference (RFI) are 

' ! This discovery is designed to find fast-changing events, of particular interest to the 
PTF collaboration. We also require at least two observations more than 45 minutes sepa- 
rated in time, to help remove moving asteroids from the discovery set. 
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all impediments to automatically producing clean images of the sky. Given 
the massive data rates soon expected from wide-field surveys (e.g., LOw 
Frequency ARray: LOFAR; Australian Square Kilometre Array Pathfinder; 
ASKAP), there is a pressing need to autonomously produce clean images 
of the radio sky. Algorithmic innovations to speed automatic image cre- 
ation have been impressive (e.g., [47] ) . For RFI mitigation, a genetic algo- 
rithm approach has produced promising results [32]. Once images are made, 
sources are detected much the same way as with optical imaging^ and catalog 
searches are used to find transients and variables [TTJ [TU] . 

1.1.2 Detection and Analysis of Variability 

For catalog-based searches, variability is determined on the basis of the col- 
lection of flux measurement as a function of time for a candidate source. 
Since variability can be manifested in many ways (such as aperiodic behav- 
ior, occasional eclipsing etc.) one single metric on variability will not suffice 
in capturing variability [62], [25] [TU [60], [23] . A series of statistical questions 
can be asked with each new epoch to each light curve. Are the data consis- 
tent with an unchanging flux, in a x 2 sense? Are there statistically significant 
deviant data points? How are those outliers clustered in time? Significant 
variability of periodic sources may be revealed by direct periodogram anal- 
ysis ([58j; see also ref. [5]). In the Poisson detection limit, such as at 7-ray 
wavebands or with detections of high-energy neutrinos, discovering variabil- 
ity in a source is akin to asking the question of whether there is a statistically 
significant change in the rate of arrival of individual events; for this, there 
are sophisticated tools (such as Bayesian blocks) for analysis [57] [63]. One of 
the important real-world considerations is that photometric uncertainty es- 
timates are always just estimates, based on statistical sampling of individual 
image characteristics. Systematic errors in this uncertainty (either too high 
or too low) can severely bias variability metrics (c.f. [25]). Characterizing 
efficiency-purity from systematic errors must be done on a survey by survey 
basis. 



1.2 Classification 

Determining the physical origin of variability is the basic impetus of clas- 
sification. But clearly what is observed and what is inferred to belie that 

4 Note that McGowan et al. [32] have developed an ML approach to faint source dis- 
covery in radio images. 
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which is observed are not the same, the latter deriving from potentially sev- 
eral interconnected and complex physical processes. A purely physical-based 
classification schema is then reliant upon subjective and potentially incor- 
rect model interpretation. For instance, to say that the origin of variability 
is due to an eclipse requires an intuitive leap, however physically relevant, 
from observations of a periodic dip in an otherwise constant light curve. A 
purely observational-based classification scheme, on the other hand, lacks 
the clarifying simplicity offered by physical classification. For example, how 
is a periodic light curve "dipping" (from an eclipsing system) different, quan- 
titatively, than an extreme example of periodic brightness changes (say from 
a pulsational variable)? To this end, existing classification taxonomies tend 
to rely on an admixture of observational and physical statements. And when 
a variable source is found the goal is in finding how that source fits within 
an established taxonomy. 

Phenomenological and theoretical taxonomies aside, the overriding con- 
ceptual challenge of classification is that no two sources in nature are iden- 
tical and so the boundaries between classes (and subclasses) are inherently 
fuzzy: there is no ground truth in classification, regardless of the amount 
and quality of the data. With finite data, the logistical challenge is in ex- 
tracting the most relevant information, mapping that onto the quantifiable 
properties derivable from instances of other variables, and finding an (ab- 
stractly construed) distance to other sources. There a several broad reasons 
for classification: 

1. Physical Interest Understanding the physical processes behind the 
diversity of variability requires numerous examples across the taxon- 
omy. Studying the power-spectrum of variability in high signal-to-noise 
light curves can be used to infer the interior structure of stars (astro- 
seismology). Modelling detached eclipsing systems can be used to infer 
the mass, radius, and temperatures of the binary components. 

2. Utility Many classes of variables have direct utility in making astro- 
physically important measurements that are wholly disconnected from 
the origin of the variability itself. Mira, RR Lyrae, and Cepheids are 
used for distant ladder measurements, providing probes of the struc- 
ture and size of the universe. Calibrated standard-candle measure- 
ments of la and IIP supernovae are cosmographic probes of fundamen- 
tal parameters. Short period AM CVn systems serve as a strong source 
of "noise" for space-based gravity wave detectors; finding and charac- 
terizing these systems through optical variability allows the sources 



10 



CHAPTER 1. BLOOM & RICHARDS 



to be effectively cleaned out of the LISA datastream, allowing more 
sensitivity searches for gravity waves in the same frequency band. 

3. Demographics Accounting for various biases, the demographics from 
classification of a large number of variable stars can be used to form 
and understand the evolutionary life-cycle of stars across mass and 
metallicity. Understanding the various ways in which high mass stars 
die can be gleaned from the demographics of supernova sub-types. 

4. Rarities and Anomalies Finding extreme examples of objects from 
known classes or new examples of sparsely populated classes has the 
potential to inform the understanding of (the more mundane) simi- 
lar objects. The ability to identify anomalous systems and discover 
new types of variables — either hypothesized theoretically or not — is 
likewise an important feature of any classification system. 

Expert-based (human) classification has been the traditional approach 
to time-series classification: a light-curve (and colors, and position on the 
sky, etc.) is examined and a judgement is made about class membership. 
The preponderance of peculiar outliers of one (historical) class may lead 
to a consensus that a new sub-class is warrantee!!. Again, with surveys of 
hundreds of thousands to billions of stars and transients, this traditional 
role must necessarily be replaced by ML and other data-mining techniques. 



1.2.1 Domain-based Classification 

Some of the most fruitful modern approaches to classification involve 
domain-specific classification: using theoretical and/or empirical models of 
certain classes of interest to determine membership of new variables in that 
class. Once a source is identified as variable, its location in color-luminosity 
space can often provide overwhelming evidence of class membership (Fig- 
ure ll.4p . Hertzsprung- Russell (H-R) diagrams obviously require distance to 
the source to be known accurately and so, until Gaia [50J, it has its utility 
restricted to those with parallax previously measured by the Hipparcos sur- 
vey. For some sources, such as RR Lyrae and quasars, location in color-color 
space suffices to provide probable classification (Figure flT5l) . Strict color cuts 

For example, type la supernovae, likely due to the explosion of a white dwarf, appear 
qualitatively similar in their light curves to some core-collapsed supernovae from hydrogen- 
stripped massive stars (Type Ib/Ic). Yet the presence or absence of silicon in the spectra 
became the defining observation that led to very different physical inferences for similar 
phenomenological types of supernovae. 
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or more general probabilistic decisions on clusterings within a certain color- 
color space can performed. (Regardless, reddening and contamination often 
make such classification both inefficient and impure.) 

Of considerable interest, given that historical and/or simultaneous color 
information is not always available and that unknown dust can affect color- 
based classification, is to classify using time-series data alone. For some do- 
mains, the light curves tell much of the story. Well before the peak brightness 
in a microlensing event, for example, an otherwise quiescent star will appear 
to brighten monotonically like a second-order power-law in time. By contin- 
uously fitting the light curve of (apparently) newly variable stars for such a 
functional form, a statistically rigorous question can be asked about whether 
that event appears to be microlensing or not. For a sufficiently-homogeneous 
class of variables, an empirical light curve can be fit to the data and those 
sources with acceptable fits can be admitted to that class. This was done 
to discover and classify RR Lyrae stars in the SDSS Stripe 82 dataset [59] . 
Such approaches require, implicitly, a threshold of acceptability. However, 
using cuts based on model probabilities and goodness-of-fit values can be 
damaging: these metrics are often a poor description of class probabilities 
due to the overly-restricted space of template models under consideration 
as well as other modeling over-simplifications. A better approach is to use 
a representative training set of sources with known class to estimate the 
ROC curve for the model fits, and to then pick the threshold value corre- 
sponding with the desired efficiency and purity of the sample. If the training 
set is truly representative, this ensures a statistical guarantee of the class 
efficiency and purity of samples generated by this approach. 

A related, but less strong statement can often be made that the vari- 
ability has "class-like" variability. For example, there is no one template of 
a quasar light curve but since quasars are known to vary stochastically like 
a damped random walk, with some characteristic timescale that correlates 
only mildly with luminosity, it is possible to capture the notion of whether 
a given light curve is statistically consistent with such behavior. In Butler 
h Bloom [16] we created a set of features designed to capture how much a 
variable was "quasar like" and found a high degree of efficiency and purity 
of quasar identification based on a spectroscopic validation sample (Figure 
ll.6p . Some variable stars, such pulsating super giants and X-ray binaries, 
also show this QSO-like behavior; so it is clear that such domain-specific 
statistical features alone cannot entirely separate classes. 



6 Such classification decisions can make use of the empirical distribution of sources 
within a class and uncertainties on the data for a given instance [10] , 
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Figure 1.4: Fractional variability of stars across the H-R diagram derived 
from Hipparcos data. Red indicates significant variability and blue low- 
amplitude variability (10% peak-to-peak). Identification of colors coupled 
with distances provide a rather clean path to classification. From [27J. 
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Figure 1.5: Color-color plot showing variable sources from Stripe 82. Region 
II is the traditional QSO locus and Region IV is the region populated by 
most RR Lyrae. There are clearly many QSOs that fall outside region IV 
(particularly high redshift QSOs), some of which are in the RR Lyrae region. 
From [16] . 
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Figure 1.6: Variability selection of quasars. Using a Bayesian framework 
to connect light-curves of point sources to damped random walk behavior, 
statistics that account for uncertainty and covariance can be developed to 
find QSO-like behavior. This selection (green line) is highly efficient at find- 
ing known QSOs (~99%) and impure at the 3% level. From [16J. 



There is significant utility in restricting the model fits to a finite number 
of classes. Indeed, one of the more active areas of domain-specific classifi- 
cation is in supernova subclassing. By assuming that a source is some sort 
of supernovae, a large library of well-observed supernova light curves (and 
photometric colors) can be used to infer the sub-type of a certain instance, 
especially when quantifying the light curve trajectory through color-color 
space [51]. Provided that the library of events spans (and samples) suffi- 
ciently well the space of possible subclasses (and making use of available 
redshift information to transform templates appropriately), Bayesian odds 
ratios can be effectively used to determine membership within calibrated 
confidence levels (see ref. [4"S]). 
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1.2.2 Feature-based Classification 

An abstraction from domain-specific classification (such as template fitting) 
is to admit that the totality of the available data belies the true classifica- 
tion, irrespective of whether we understand the origin of that variability or 
have quantified specifically what it means to belong to a certain class. We 
classify on features, metrics derived from time-series and contextual data. 
There are a number of practical advantages to this transformation of the 
data. First, feature creation allows heterogeneous data to be mapped to 
a more homogeneous m-dimension real number line space. In this space, 
instances of variable objects collected from different instruments with dif- 
ferent cadences and sensitivities can be directly intercompared. This is the 
sort of space where machine-learning algorithms work well, allowing us to 
bring to bear the richness of the machine-learning literature to astronom- 
ical classification. Second, features may be arbitrary simple (e.g., median 
of the data) or complex. So in cases with only limited data availability — 
when, for instance, light curve fitting might fail — we have a subset of metrics 
that can still be useful in classification. Many machine-learning frameworks 
have prescriptions for dealing with missing data that do not bias the results. 
Third, many feature-based classification methods produce class probabilities 
for each new source, and there are well-prescribed methods in ML both for 
calibrating the classification results and to avoiding overfitting. Last, ML 
approaches allow us to explicitly encode the notion of loss (or "cost") in 
the classification process, allowing for a controlled approach to setting the 
efficiency and purity of the final results. 

There is, of course, a huge space of possible features and many will be 
significantly related to others (e.g., mean and median will strongly correlate). 
One of the interesting advantages of some ML techniques is the classification 
robustness both in the face of feature covariance and "useless" features. 
This is freeing, at some level, allowing us to create many feature generators 
without worry that too many kitchen sinks will sink the boat. The flip side, 
however, is that there are always more features on the horizon than those in 
hand that could be incrementally more informative for a certain classification 
task. 

Methods for feature-based classification of time-varying sources in as- 
tronomy come in one of two flavors. The first are supervised methods, 
which use both the features and previously-known class labels from a set of 
training data to learn a mapping from feature to class space. The second are 
unsupervised methods (also called statistical clustering), which do not use 
class labels and instead seek to unveil clustering of the data in feature space. 
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The end goals of these approaches are different: supervised classification at- 
tempts to build an accurate predictive model where, for new instances, the 
true classes (or class probabilities) can be predicted with as few errors as 
possible, whereas unsupervised classification seeks a characterization of the 
distribution of features, such as estimating the number of groups and alloca- 
tion of the data points into those groups. A common technique (e.g., |26j ) is 
to blend the two by first performing unsupervised classification and subse- 
quently analyzing the resultant clusters with respect to a previously-known 
set of class labels. 

Feature Creation 

The two broad classes of features, time-domain and context, each provide 
unique value to classification but also inhere unique challenges. The most 
straightforwardly calculated time-domain features are based on the distri- 
bution of detected fluxes, such as the various moments of the data (mean, 
skewness, kurtosis). Variability metrics, such as x 2 under an unchanging 
brightness hypothesis and the so-called Stetson variability quantities [62], 
are easily derived and make use of photometric uncertainties. Quantile-based 
measurements (such as the fraction of data observed between certain flux 
ranges) provide some robustness to outliers and provide a different view of 
the brightness distribution than moments. Inter-comparisons (e.g., ratios) 
of these metrics across different filters may themselves be useful metrics. 

Time-ordered metrics retain phase information. Frequency analysis, find- 
ing significant periodicity in the data, provides powerful input to the clas- 
sification of variable stars (Figure II. 7p . There are significant limitations to 
frequency-domain features, most obvious of which is that a lot of time-series 
data is required to make meaningful statements: with three epochs of data, it 
makes no sense to ask what the period of the source is. Even in the limit that 
a frequency of interest (/o) is potentially sampled well in a Nyquist sense 
(where the total time duration of the light curve is longer than ~ 2//o), the 
particular cadence of the observations may strongly alias the analysis, ren- 
dering significance measurements on peaks in the periodogram intractable. 
And unless the sources are regularly sampled (which, in general, they are 
not) there will be covariance across the power spectrum. Finding significant 
periods can mean fitting a small amount of data over millions of trial fre- 
quencies, resulting in frequency-domain features that are computationally 
expensive^. We review techniques and hybrid prescriptions for period finding 



7 One practical approach for data observed with similar cadences is to compute the 
periodogram at a small number of a fixed set of frequencies and set the power/significance 
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Figure 1.7: Distribution of two frequency-domain features derived for 25 
classes of variable stars from OGLE and Hipparcos photometry: a) log of the 
most significant frequency (units of day" 1 ) from a generalized Lomb-Scargle 
periodogram analysis, and b) the log of the amplitude of the most significant 
period, in units of magnitude. Mira variables (top) are long-period, high- 
amplitude variables, while delta Scuti stars (10th from top) are short-period, 
low-amplitude variables. Aperiodic sources, such as S Doradus stars (5th 
from bottom), have a large range in effective dominant period. From [54J. 

and analysis in Richards et al. |54j . 

Other time-ordered features may be extracted using a notion of "dis- 
tance" between a given instance of a light curve and all others. For instance, 
to derive features useful for supernova typing, Richards et al. [53] built up 
a matrix of pairwise distances between each pair of SNe (including both 
labeled and unlabeled instances) based on interpolating spline fits to the 
time-series measurements in each photometric band. The pairwise distance 
matrix was subsequently fed into a diffusion map algorithm that embeds the 
set of supernovae in an optimal, low-dimensional feature space, separating 
out the various SN subtypes (Figure ll.8p . In a variable star analysis, Deb 
& Singh [20] use the covariance matrix of a set of interpolated, folded light 
curves to find features using PCA. In addition to using distance-based fea- 
tures to capture the time variability of sources, the way in which flux changes 
in time can be captured by fitting parameters under the assumption that 
the data are due to a Gaussian process [52] . 

We define context-specific features as being all derivable features that 
are not expected to change in time. The location of the event on the sky, 
in Galactic or ecliptic coordinates, obviously provides a strong indication 



at each of these frequencies to be separate features. Covariance is then implicitly dealt 
with at the ML level, rather than feature generation level (e.g., ref. [26]) 
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Figure 1.8: Light curve distance measures can be used in conjunction with 
spectral methods, such as diffusion map, to compute informative features. 
In this example, a spline-based distance between supernova light curves, 
designed to capture both shape and color differences, was used. In the first 
two diffusion map coordinates (left), Type la and II SNe are distinguished, 
whereas higher features (right) reveal some separation between la and Ib/c 
supernovae. From |53j . 
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of whether the event has occurred in the Galaxy or in the Solar System. 
Metrics on the distance to the nearest detected galaxy and the parameters 
of that galaxy (its color, size, inferred redshift, etc.) are crucial features for 
determining the nature of extragalactic events. Even with very little time- 
domain data a strong classification statement can be made: for example, 
an event well off the ecliptic plane that occurs on the apparent outskirts 
of a red, spiral-less galaxy is almost certainly a type la supernova. One 
of the main challenges with context features is the heterogeneity of the 
available data. For example, in some places on the sky, particularly in the 
SDSS footprint, much is known about the stars and galaxies near any given 
position. Outside such footprints, context information may be much more 
limited. From a practical standpoint, if context information is stored only in 
remotely queryable databases, what information is available and the time it 
takes to retrieve that information may be highly variable in time. This can 
seriously affect the computation time to produce a classification statement 
on a given place on the sky. 

Supervised Approaches 

Using a sample of light curves whose true class membership is known (e.g., 
via spectral confirmation) , supervised classification methods learn a statisti- 
cal model (known as a classifier) to predict the class of each newly-observed 
light curve from its features. These methods are constructed to maximize the 
predictive accuracy of the classifications of new sources. The goal of these 
approaches is clear: given a set of previously-labeled variables, make the 
best guess of the label of each new source (and optionally find the sources 
that do not fit within the given label taxonomy) . Many supervised classifica- 
tion methods also predict a vector of class probabilities for each new source. 
These probabilistic classifiers can be used to compute ROC curves for the 
selection of objects from a specified science class — such as those in Figure 
11.91 — from which the optimal probability threshold can be chosen to create 
samples with desired purity and efficiency. 

There are a countless number of classification methods in statistics and 
machine learning literature. Our goal here is to review a few methods that 
are commonly used for supervised classification of time-variable sources in 
astronomy. 

If the class-wise distributions of features were all completely known 
(along with the class prior proportions), then for a new source we would use 
Bayes' rule to compute the exact probability that the source is from each 
class, and classify the source as belonging to the class of maximal probabil- 
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Figure 1.9: ROC curves for the selection of supernovae from a random forest 
probabilistic classifier, using data from the SN Photometric Classification 
Challenge |36j . Left: For classification of Type la SNe, in the spectroscopic 
sample we can achieve 95% efficiency at a 99% purity or > 99% efficiency at 
98% purity, depending on the threshold. Right: For Type II-P supernovae, 
the classifier performs even better, with higher efficiency at each given purity 
level. From |53j. 
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ity. This is referred to as Bayes' classifier, and is the provable best possible 
classifier in terms of error rate. In practice, however, we do not know the 
class-wise feature distributions perfectly. Many methods attempt to estimate 
the class densities from the training data. In Kernel Density Estimation 
(KDE) classification, the class-wise feature distributions are estimated using 
a non-parametric kernel smoother. This approach has been used to classify 
supernova light curves [46J . A pitfall of this technique is the tremendous diffi- 
culty in estimating accurate densities in high-dimensional feature spaces via 
non-parametric methods (this is referred to as the curse of dimensionality). 
To circumvent this problem, Naive Bayes performs class-wise KDE on one 
feature at a time, assuming zero covariance between features. Though this 
simplifying assumption is unlikely to be true, Naive Bayes has enjoyed much 
use, including in time-domain science |41| . A step up from Naive Bayes is 
Bayesian Network classification, which assumes a sparse, graphical condi- 
tional dependence structure amongst the features. This approach was used 
with considerable success for variable star classification |2H [22] . 

Alternatively, class-wise distributions can be estimated using parametric 
models. The Gaussian Mixture classifier assumes that the feature distri- 
bution from each class follows a multivariate Gaussian distribution, where 
the mean and covariance of each distribution are estimated from the training 
data. This approach is used widely in variable star classification (e.g., |21j . 
[22] , and [7] ) . The advantage of this parametric approach is that it does not 
suffer from curse of dimensionality. However, if the data do not really follow 
a mixture of multivariate Gaussian distributions, then predictions may be 
inaccurate: for example, we showed in [54J that using the same set of variable 
star features, a random forest classifier outperforms the Gaussian mixture 
classifier by a statistically significant margin. Gaussian mixture classifiers 
are also called Quadratic Discriminant Analysis (QDA) classifiers (or 
Linear Discriminant Analysis, LDA, if pooled covariance estimates are 
used). These names refer to the type of boundaries that are induced between 
classes, in feature space. 

Indeed, many classification methods instead focus on locating the op- 
timal class boundaries. Support Vector Machines (SVMs) find the 
maximum- margin hyperplane to separate instances of each pair of classes. 
Kernelization of a SVM can easily be applied to find non-linear class bound- 
aries. This is approach used to classify variable stars in a number of re- 
cent papers (2TJ [66l [54] . The K- nearest neighbors (KNN) classifier pre- 
dicts the class of each object by voting its K nearest neighbors in fea- 
ture space, thereby implicitly estimating the class decision boundaries non- 
parametrically. Another popular method is Classification Trees, which 
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performs recursive binary partitioning of the feature space to arrive at a 
set of pure, disjoint regions. Trees are powerful classifiers because they can 
capture complicated class boundaries, are robust to outliers, are immune to 
irrelevant features, and easily cope with missing feature values. Their draw- 
back is that due to their hierarchical nature, they tend to have high variance 
with respect to the training set. Tree ensemble methods, such as Bagging, 
Boosting, and Random Forest overcome this limitation by building many 
classification trees to bootstrapped versions of the training data and aver- 
aging their results. Boosting, which has been used by Newling et al. [IB] 
for SN classification and Richards et al. |54j for variable star classification, 
iteratively reweights the training examples to increasingly focus on difficult- 
to-classify sources. Random Forest, which was used by multiple entrants in 
the Supernova Photometric Classification Challenge [36] and by our group 
[53] for variable star classification, builds de-correlated trees by choosing a 
different random subset of features for each split in the tree-building pro- 
cess. In Richards et al. [53], we found that random forest was the optimal 
method for a multi-class variable star problem in terms of error rate (Figure 

In time-domain classification problems, we often have a well-established 
hierarchical taxonomy of classes, such as the variable star taxonomy in Fig- 
ure 11.111 Incorporating a known class hierarchy into a classification engine 
is a research field that has received much recent attention in the machine 
learning literature (e.g., [61]). Several attempts for hierarchical classifi- 
cation have been made in variable star problems. Debosscher et al. [22] use 
a 2-stage Gaussian mixture classifier, first classifying binaries versus non- 
binaries, while Blomme et al. [7] use a multi-stage hierarchical taxonomy. In 
Richards et al. [53], we use two methods for hierarchical classification, both 
using random forest and the taxonomy in Figure fl. Ill 

Finally, no discussion of supervised classification would be complete 
without mentioning the hugely-popular method Artificial Neural Net- 
works (ANN). Though there are several versions of ANN, in their simplest 
form they are non-linear regression models that predict class as a non-linear 
function of linear combinations of the input features. Drawbacks to ANN are 
their computational difficulty (e.g., there are many local optima) and lack of 
interpretability, and for these reasons they have lost popularity in the statis- 
tics literature. However, they have enjoyed much success and widespread use 
in astronomy. In time-domain astronomy, ANNs have been used by for vari- 
able star classification [281 [56l [2l] and by one team in the SN Classification 
Challenge (though the team's ANN entry fared much worse than their ran- 
dom forest entry, using the same set of features) . 
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Figure 1.10: Distribution of cross-validation error rates for several classi- 
fiers on a mixed data set of OGLE and Hipparcos sources (see [S3]). The 
classifiers are divided based on the features on which they were trained; 
from left to right: (1) periodic plus non-periodic features, (2) the Lomb- 
Scargle features estimated by [21], (3) the Lomb-Scargle features estimated 
by [S3], and (4) only non-periodic features. In terms of mis-classification rate, 
the random forest classifier trained on all of the features perform the best. 
Classifiers considered are: classification trees (CART & C4.5 variants), K- 
nearest neighbors (KNN), tree boosting (Boost), random forest (RF), pair- 
wise versions of CART (CART.pw), random forest (RF.pw), and boosting 
(Boost. pw), pairwise SVM (SVM.pw), and two hierarchical random forest 
classifiers (HSC-RF, HMC-RF). All of the classifiers plotted, except sin- 
gle trees, achieve better error rates than the best classifier from [21] (dashed 
line), who considered Bayesian Network, Gaussian Mixture, ANN, and SVM 
classifiers. 
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Figure 1.11: Variable star classification hierarchy for the problem consid- 
ered in |54j . This structure can be used in a hierarchical classifier to yield 
improved results. The hierarchy is constructed based on knowledge of the 
physical processes and phenomenology of variable stars. At the top level, the 
sources split into three major categories: pulsating, eruptive, and multi-star 
systems. 
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Unsupervised & Semi-Supervised Approaches 

Unsupervised classification (statistical clustering) methods attempt to find 
k clusters of sources in feature space. These methods do not rely on any 
previously-known class labels, and instead look for natural groupings in the 
data. After clusters are detected, labels or other significance can be affixed 
to them. In time-domain studies, these methods are useful for explorative 
studies, for instance to discover the number of statistically-distinct classes 
in the data or to discover outliers and anomalous groups. In the absence 
of confident training labels, an unsupervised study is a powerful way to 
characterize the distributions in the data to ultimately determine labels and 
build a predictive model using supervised classification. 

In time-domain astronomy, the most popular clustering method is Gaus- 
sian Mixture Modeling. This method fits a parametric mixture of Gaus- 
sian distributions to the data by maximum likelihood via the expectation- 
maximization (EM) algorithm. A penalized likelihood or Bayesian approach 
can be used to estimate the number of clusters present in the data. The 
Autoclass method |18| is a Bayesian mixture model clustering method that 
was used by Eyer & Blake [26] to cluster ASAS variable stars. Sarro et 
al. |55| use another variant of Gaussian Mixture Modeling to cluster a large 
database of variable stars. 

Self-Organizing Maps (SOM) is another popular unsupervised learn- 
ing method in time-domain astronomy. This method aims to map the high- 
dimensional feature vectors down to a discretized two-dimensional coordi- 
nate plane for easy visualization. SOM is the unsupervised analog of ANN 
that uses a neighborhood function to preserve the topology of the input 
feature space. This method has been used previously [HI [69] to obtain two- 
dimensional parametrization of astronomical light curves. In those studies, 
SOM was performed prior to visual analysis of the labeled sources in this 
space. This class of approach, where available class labels are ignored to ob- 
tain a simple parametrization of the light curve features and subsequently 
used in a learning step, is called semi-supervised learning. The advantage to 
this technique is that, if the relevant class information is preserved by the 
unsupervised step, then supervised classification will be easier in the reduced 
space. Semi-supervised classification permeates the time-domain astronomy 
literature. In addition to the afore-mentioned SOM studies, other authors 
have used PCA |66} [20J and diffusion map [53J to parametrize time- variable 
sources prior to classification. Of these studies, only Richards et al. [53] used 
a rigorous statistical classifier. 
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1.3 Future Challenges 

For any finite collection of photons, our knowledge of the true flux is in- 
herently uncertain. This basic phenomenological uncertainty belies an even 
greater uncertainty in the physical origin of what we think we are witness- 
ing. As such, any classification scheme of a given variable or transient source 
must be inherently probabilistic in nature. We have outlined how — with an 
emerging influence of the machine-learning literature — we can gain traction 
on the probabilistic classification challenge. Calibrating (and validating) the 
output probabilities from machine-learning frameworks is still a nascent en- 
deavor. 

Feature generation is obviously a key ingredient to classification and we 
have presented evidence that random forest classifiers are particularly useful 
at using features that are most relevant to classification and skirting the 
problem of large covariance between features. On the positive side, this frees 
us from having to create a small set of perfectly tuned features. However, how 
do we know when we have exhausted the range of reasonable feature space 
for classification? Our suspicion is that expert knowledge has already imbued 
the feature creation process with much of the knowledge implicitly needed 
for classification: we know for instance that phase offset between the first 
and second most dominant periods can be a powerful way to distinguish two 
closely related classes of pulsational variables. There may be information- 
theoretic (and feature-agnostic) answers to this question, which might be 
attacked with some genetic programming framework. 

On statistical grounds, implicit in the feature generation procedure is 
that the distribution of features (and their covariances) on the training set 
will be similar to the set of instances of sources we wish to classify. A gross 
mismatch of the characteristics of these two sets is likely to be a significant 
problem for the robustness of the classification statements. No study to date 
has looked at how we can use the knowledge gleaned from one survey and 
apply that to classification in another. For instance, if a classifier is blindly 
trained on one survey to classify objects from another, then it will achieve 
sub-optimal results by not considering differences in feature distribution 
between the surveys. Ideas from statistics, such as importance sampling, 
can be exploited to account for these differences. 

As these very basic algorithmic questions are addressed, the computa- 
tional implications, using events from real surveys, will have to be under- 
stood. Is feature creation and the application of an existing machine-learned 
framework fast enough for a given data stream? How can loss functions be 
embedded in computational choices at the feature and the labeling levels? 
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For streaming surveys, how often should the learning model be updated 
with newly classified examples from the survey itself? What are the roles 
of massively parallel hardware (e.g. graphical processing units) in feature 
generation, learning, and classification? 

Astronomical datasets have always presented novel algorithmic, compu- 
tational, and statistical challenges. With classification based on noisy and 
sometimes-spurious data, the forefront of all of these endeavors is already 
being stretched. As astronomers, expanding the machine-learning literature 
is a means to an end — if not just a way to keep our heads above water — 
building a vital set of tools for the exploration of the vast and mysterious 
dynamic universe. 
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